Claude Grasland & Etienne Toureille
19/10/2021
Previous analysis on german (left) and french (right) newspapers has demonstrated the interest to analyse networks of states and world regions :
But before to validate this results we need :
The objective of this short note is to explore the possibility of Wikidata for the production of multilingual dictionaries of world regions and more generally regional imaginations. In order to test the interest of this approach, we will try to produce multilingual dictionaries for the identification of different types of “regions” related to the division of the Earth (“natural”) or the division of the World (“political”)
“Earth/Natural regions” : i.e. regions that are used in Atlas of Dictionaries for the localization at the surface on the earth and/or taught to young students in textbook as primary or secondary divisions of the surface of the Earth.
“World/Political regions” : i.e. groups of states that are linked by formal international organization (e.g. European Union, ASEAN, …) and/or are perceived as linked politically or economically by themselves of in the eye of external observers (e.g. BRICS, Eurasia)
So-called “Physical maps” in Atlas are a good source :
Textbooks and educative games for children are also crucial :
An attempt to classify intergovernmental organization in 4 types :
Source : https://commons.wikimedia.org/wiki/Atlas_of_international_organizations
We propose to etablish a dictionary of Earth and World Regions in the five languages of interest for the project IMAGEUN :
We want to avoid any “eurocentric” or “anglocentric” perspective in the definition of entities. Therefore our definition of entities will follow the following rules :
To summarize, we propose to build partial equivalences between entities that belong to different lexical universes.
The comparison between lexical universes will be necessarily limited to a small sample of entities for which we can assume that the entities are approximately equivalent.
Wikidata defines itself as
The first interest of wikidata is to provide unique code of identifications of objects. For example a research about “Africa” will produce a list of different objects characterized by a unique code :
Once we have selected an entity (e.g. Q15) we obtain a new page with more detailed informations in english but also in all other languages available in Wikipedia.
A lot of information are available concerning the entity but, at this stage, the most important ones for our research are :
Of course we should not take for granted the answers proposed by wikidata (as noticed by Georg, Wikipedia is a matter of research for IMAGEUN …) but without any doubt, it offers a very good opportunity to clarify our questions and help us to build tools for recognition of world regions and other geographical imaginations in a multilingual perspective.
A wikipedia entity like Q15 is an element of an ontology designed by its author for specific purposes. The specificity of the wikidata ontology is the fact that it is a multilinligual web where Q15 is a node of the web present in different linguistic layers. It means that we don’t have a single name or a single definition of Q15, except if we adopt the neocolonial perspective to choose the english language as reference. Depending on the context (i.e. the language or sub-language), Q15 could be defined as :
| language | definition |
|---|---|
| fr | A continent named Afrique |
| en | A continent on the Earth’s northern and southern hemispheres named Africa or African continent |
| de | A “Kontinent auf der Nord- und Südhalbkugel der Erde” named “Afrika” |
| tr | A “Dünya nin kuzey ve güney yarikürelerindeki bir kita” named “Afrika” or “Afrika kitasi” |
| ar | The second largest continent in the world in terms of area and population, comes second only to Asia (trad.) |
The existence of the same code of wikipedia entities does not offer any guarantee of concordance between the geographical objects found in news published in different languages or different countries. But - and it is the important point - it help us to point similarities and differences between set of geographical entities that are more or less comparable in each language.
Ex. 1 Amazonie : In french language, Amazonie is associated to the entity Q2841453 which is defined as a “région naturelle en Amérique du Sud”. But this entity does not exist apparently in turkish language. At the same time the french language propose also an entity “Forêt amazonienne” Q177567 defined as forêt équatoriale située dans le bassin amazonien en Amérique du Sud which is present in tukish language. We have also a third entity bassin amazonien identified as Q244451 which refers to so-called régions naturelles based on water basin (Cholley 1939)
Ex.2 Proche-Orient/Moyen-Orient/Pays du Golfe/ Asie de l’Ouest : In french language we find four entities describing the complex geopolitical area located in western part of Asia and eastern part of Mediterranea. But this entities are not necessary used in all languages with the same frequency and can be completed by other entities like western Asia. The entity Proche-Orient (Q48214) which is frequent in french is only available in 30 languages when the Moyen-Orient (Q7204) is available in 84 languages, golfe Persique (Q34675) in 77 languages and Asie de l’Ouest (Q27293) in 71 languages.
Having in mind the limits of the equivalence of entities across languages, it can nevertheless be an interesting experience to select a set of wikipedia entities (Q15, Q258, Q4412 …) and to examine their relative frequency in our different media from different countries with different languages. A typical hypothesis could be something like :
which is not equivalent to the question
but rather equivalent to the two joint questions
The package WikidataR is an interface for the use of the Wikidata API in R language. Equivalent tools are available in Python and other languages for those non familiar with R. And it is of course possible to use directly the API. The first step is to install the most recent version of the R package WikidataR which install also related packages of interest.
If we start our research with the word “Afrique” in french language we find more than 50 entities that contain this word in their label. Only the first 10 are presented below :
| item_id | item_label | item_desc | item_lang | item_text |
|---|---|---|---|---|
| Q15 | Africa | continent on the Earth’s northern and southern hemispheres | fr | Afrique |
| Q181238 | Africa | Roman province on the northern African coast covering parts of present-day Tunisia, Algeria, and Libya | fr | Afrique |
| Q203548 | African Plate | continental plate underlying Africa | fr | Afrique |
| Q258 | South Africa | sovereign state in Southern Africa | fr | Afrique du Sud |
| Q4412 | West Africa | region of Africa | fr | Afrique de l’Ouest |
| Q132959 | Sub-Saharan Africa | area of the continent of Africa that lies south of the Sahara Desert | fr | Afrique subsaharienne |
| Q27394 | Southern Africa | southernmost region of the African continent | fr | Afrique australe |
| Q27407 | East Africa | easterly region of the African continent | fr | Afrique de l’Est |
| Q27381 | North Africa | northernmost region of the African continent | fr | Afrique du Nord |
| Q2826196 | Afrique | Wikimedia disambiguation page | fr | Afrique |
The analysis of the list of result reveals four situations :
Target entities: A first list is related to entities that can be considered as world regions or geographical imaginations of interest for IMAGEUN. It is typically the case for the whole continent of Afrique (Q15) and its different subdivisions like North Africa (Q27381), West Africa (Q4412), Sub-Saharan Africa (Q132959).
Control entities : A list of entities that are not regions but should be controled if we want to identify our target entities. A typical example is the sovereign state of South Africa (Q258) which will necessary introduce mistakes in the identification of Africa as a continent if it is not controled. The problem will not necessary exist in all languages (e.g. German) but is important.
Ambiguous entities : Some entities are ambiguous because they are not regions but use exactly the same textual units than a target entity. It is for example the case of the roman province of Africa (Q181238) which can not be easily differentiated from the continent, except by manual inspection. This units are not easy to control but fortunately are generally not frequent.
Insignificant entities : Those entities that are exceptional inthe corpus can be simply gnored.
We propose a semi-automatic method of extractions of entities in different languages that implies the presence of human expert at each step of the analysis. The figure below describe an example of research of world regions related to Africa in three languages.
The programs used for computer implementation are explained in the media cookbook on github with an example of implementation available onf the following page
We have realized a test of the previous workflow on an arbitraty selection of entities which are mainly related to continent and “natural” Earth divisions :
This quick and dirty analysis does not offer any guarantee of quality because :
The purpose is therefore only to provide food for thought.
We start from a corpus of text where target wikipedia entities has been recognized :
| text | source | date | regs | nbregs |
|---|---|---|---|---|
| Asie, Afrique, Europe: la nouvelle stratégie de l’État islamique | fr_FRA_figaro | 2019-05-03 | Q48 Q15 Q46 | 3 |
| ‘Rolling emergency’ of locust swarms decimating Africa, Asia and Middle East | en_GBR_guardi | 2020-06-08 | Q15 Q48 Q7204 | 3 |
| Coronavirus pushes beyond Asia as it takes aim at Europe and Middle East | en_NIR_beltel | 2020-02-24 | Q48 Q46 Q7204 | 3 |
| Solar eclipse wows stargazers in Africa, Asia and the Middle East | en_NIR_beltel | 2020-06-21 | Q15 Q48 Q7204 | 3 |
| Avrasya Tüneli Avrupa Anadolu geçisi bir saat trafige kapatildi: Uzun araç kuyruklari olustu | tr_TUR_yenisa | 2020-05-28 | Q5401 Q46 Q12824780 | 3 |
| «L’émigration permanente vers l’Europe prive l’Afrique de ses jeunes les plus brillants» | fr_FRA_figaro | 2019-02-15 | Q46 Q15 | 2 |
| L’Amérique et l’Europe frappent la Syrie au portefeuille | fr_FRA_figaro | 2019-02-26 | Q828 Q46 | 2 |
| Sommet Trump-Kim à Hanoï: quels enjeux pour les alliés des États-Unis en Europe et en Asie? | fr_FRA_figaro | 2019-02-27 | Q46 Q48 | 2 |
| Des migrants venus d’Asie traversent la Méditerranée | fr_FRA_figaro | 2019-05-03 | Q48 Q4918 | 2 |
| Sept pays d’Amérique du Sud en sommet pour défendre l’Amazonie | fr_FRA_figaro | 2019-09-06 | Q18 Q2841453 | 2 |
For the experience 2, we create a new object called hypercube where the text of news has been removed and where we keep only the number of tags or proportion of news speaking from one or several regions (where1, where2), by media (who) and by time period (when)
## Joining, by = "id"
| who | when | where1 | where2 | tags | news |
|---|---|---|---|---|---|
| fr_FRA_figaro | 2019-01-01 | Q46 | Q15 | 2 | 0.3611111 |
| fr_FRA_figaro | 2020-01-01 | Q46 | Q15 | 2 | 0.5000000 |
| fr_FRA_figaro | 2021-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| de_DEU_frankf | 2021-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| de_DEU_suddeu | 2020-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| en_GBR_telegr | 2020-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| en_IRL_irtime | 2019-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| en_IRL_irtime | 2020-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| en_IRL_irtime | 2021-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| tr_TUR_cumhur | 2020-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| tr_TUR_yenisa | 2021-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| ar_TUN_babnet | 2021-01-01 | Q46 | Q15 | 1 | 0.2500000 |
| fr_TUN_ecomag | 2019-01-01 | Q46 | Q15 | 1 | 0.2500000 |
We can propose firstly a table of top entities in the whole corpus of newspapers.
| id | de | en | fr | tr | nb | |
|---|---|---|---|---|---|---|
| 1 | Q46 | Europa | Europe | Europe | Avrupa | 4546 |
| 2 | Q15 | Afrika | Africa | Afrique | Afrika | 1022 |
| 3 | Q4918 | Mittelmeer | Mediterranean Sea | mer Méditerranée | Akdeniz | 912 |
| 4 | Q7204 | Mittlerer Osten | Middle East | Moyen-Orient | Orta Dogu | 332 |
| 5 | Q48 | Asien | Asia | Asie | Asya | 293 |
| 6 | Q66065 | Sahelzone | Sahel | Sahel | Sahel | 240 |
| 7 | Q98 | Pazifischer Ozean | Pacific Ocean | océan Pacifique | Büyük Okyanus | 200 |
| 8 | Q25322 | Arktis | Arctic | Arctique | Arktika | 180 |
| 9 | Q97 | Atlantischer Ozean | Atlantic Ocean | océan Atlantique | Atlas Okyanusu | 180 |
| 10 | Q1286 | Alpen | Alps | Alpes | Alpler | 174 |
| 11 | Q28227 | Maghreb | Maghreb | Maghreb | Magrip | 136 |
| 12 | Q6583 | Sahara | Sahara | Sahara | Sahra | 122 |
| 13 | Q12585 | Lateinamerika | Latin America | Amérique latine | Latin Amerika | 122 |
| 14 | Q664609 | Karibik | Caribbean | Caraïbes | Karayipler | 110 |
| 15 | Q51 | Antarktika | Antarctica | Antarctique | Antarktika | 105 |
| 16 | Q2841453 | Amazonien | Amazonia | Amazonie | NA | 104 |
| 17 | Q48214 | Naher Osten | Near East | Proche-Orient | Yakin Dogu | 84 |
| 18 | Q35942 | Polynesien | Polynesia | Polynésie | Polinezya | 82 |
| 19 | Q18 | Südamerika | South America | Amérique du Sud | Güney Amerika | 74 |
| 20 | Q23522 | Balkanhalbinsel | Balkans | Balkans | Balkanlar | 71 |
| tab1 | Cumhuryet_Region | Cumhuryet pct | Yeni Savas_Region | Yeni Savas pct |
|---|---|---|---|---|
| 1 | Avrupa | 64.5 | Avrupa | 53.4 |
| 2 | Akdeniz | 16.6 | Akdeniz | 24.1 |
| 3 | Afrika | 4.2 | Afrika | 7.1 |
| 4 | Asya | 2.9 | Asya | 3.0 |
| 5 | Avrasya | 2.4 | Avrasya | 2.4 |
| 6 | Antarktika | 1.5 | Sahra | 1.5 |
| 7 | Sahra | 1.2 | Orta Dogu | 1.5 |
| 8 | Orta Dogu | 1.1 | Antarktika | 1.3 |
| 9 | Orta Asya | 0.7 | Kafkasya | 1.1 |
| 10 | Latin Amerika | 0.6 | Basra Körfezi | 0.9 |
| tab1 | FAZ_Region | FAZ pct | Süd. Zeit._Region | Süd. Zeit. pct |
|---|---|---|---|---|
| 1 | Europa | 57.4 | Europa | 49.4 |
| 2 | Afrika | 6.8 | Afrika | 8.3 |
| 3 | Mittelmeer | 5.0 | Mittlerer Osten | 8.1 |
| 4 | Asien | 4.2 | Mittelmeer | 7.4 |
| 5 | Alpen | 3.4 | Alpen | 4.8 |
| 6 | Mittlerer Osten | 2.4 | Naher Osten | 2.5 |
| 7 | Osteuropa | 2.0 | Südamerika | 2.4 |
| 8 | Balkanhalbinsel | 1.9 | Lateinamerika | 1.7 |
| 9 | Südamerika | 1.4 | Arktis | 1.7 |
| 10 | Südostasien | 1.4 | Asien | 1.5 |
| tab1 | Figaro_Region | Figaro pct | Le Monde_Region | Le Monde pct |
|---|---|---|---|---|
| 1 | Europe | 39.5 | Europe | 27.2 |
| 2 | mer Méditerranée | 8.2 | Afrique | 18.5 |
| 3 | Afrique | 6.0 | Sahel | 9.9 |
| 4 | Sahel | 4.9 | mer Méditerranée | 8.6 |
| 5 | Amazonie | 4.6 | Proche-Orient | 4.2 |
| 6 | Alpes | 3.7 | Moyen-Orient | 3.8 |
| 7 | Polynésie | 3.4 | Amazonie | 2.5 |
| 8 | Moyen-Orient | 2.9 | Alpes | 2.5 |
| 9 | océan Pacifique | 2.5 | Polynésie | 2.5 |
| 10 | Asie | 2.2 | Sahara | 2.1 |
| tab1 | Guardian_Region | Guardian pct | Daily Telegraph_Region | Daily Telegraph pct |
|---|---|---|---|---|
| 1 | Europe | 36.6 | Europe | 50.8 |
| 2 | Africa | 9.6 | Africa | 12.6 |
| 3 | Arctic | 7.2 | Asia | 4.8 |
| 4 | Pacific Ocean | 7.0 | Caribbean | 4.0 |
| 5 | Middle East | 6.7 | Pacific Ocean | 3.3 |
| 6 | Atlantic Ocean | 4.4 | Middle East | 2.8 |
| 7 | Asia | 3.0 | Southeast Asia | 2.6 |
| 8 | Latin America | 2.9 | Arctic | 2.6 |
| 9 | Antarctica | 2.6 | South China Sea | 1.8 |
| 10 | Caribbean | 2.4 | Atlantic Ocean | 1.8 |
| tab1 | Irish Times_Region | Irish Times pct | Belfast Telegraph_Region | Belfast Telegraph pct |
|---|---|---|---|---|
| 1 | Europe | 56.5 | Europe | 51.3 |
| 2 | Atlantic Ocean | 4.5 | Africa | 7.4 |
| 3 | Africa | 4.4 | Atlantic Ocean | 7.3 |
| 4 | Asia | 3.9 | Arctic | 5.7 |
| 5 | Pacific Ocean | 3.8 | Middle East | 4.6 |
| 6 | Middle East | 3.7 | Asia | 4.5 |
| 7 | Caribbean | 2.5 | Caribbean | 3.4 |
| 8 | Maghreb | 1.9 | Pacific Ocean | 3.1 |
| 9 | Alps | 1.8 | South America | 1.3 |
| 10 | Latin America | 1.6 | Central America | 1.3 |
Due to the limited number of news, only top 5 news is presented. The newspaper Babnet was in arabic language.
| tab1 | Babnet (ar)_Region | Babnet (ar) pct | Econ. Mag_Region | Econ. Mag pct | La Presse_Region | La Presse pct | Réalités_Region | Réalités pct |
|---|---|---|---|---|---|---|---|---|
| 1.0 | Afrique | 46.3 | Afrique | 34.3 | mer Méditerranée | 30.8 | Afrique | 40.0 |
| 2.0 | Europe | 22.7 | Maghreb | 18.6 | Afrique | 30.2 | Europe | 14.2 |
| 3.5 | Maghreb | 6.8 | mer Méditerranée | 14.7 | Sahel | 15.1 | Maghreb | 11.0 |
| 3.5 | Sahel | 6.8 | Europe | 11.9 | Maghreb | 6.4 | Sahel | 9.7 |
| 5.0 | Moyen-Orient | 5.5 | Afrique du Nord | 5.8 | Europe | 4.7 | mer Méditerranée | 9.0 |
## Joining, by = "id"